247 research outputs found

    Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

    Full text link
    Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page

    A Brief Tour through Provenance in Scientific Workflows and Databases

    Get PDF
    Within computer science, the term provenance has multiple meanings, due to different motivations, perspectives, and assumptions prevalent in the respective communities. This chapter provides a high-level “sightseeing tour” of some of those different notions and uses of provenance in scientific workflows and databases.Ope

    On the Reusability of Data Cleaning Workflows

    Get PDF
    The goal of data cleaning is to make data fit for purpose, i.e., to improve data quality, through updates and data transformations, such that downstream analyses can be conducted and lead to trustworthy results. A transparent and reusable data cleaning workflow can save time and effort through automation, and make subsequent data cleaning on new data less errorprone. However, reusability of data cleaning workflows has received little to no attention in the research community. We identify some challenges and opportunities for reusing data cleaning workflows. We present a high-level conceptual model to clarify what we mean by reusability and propose ways to improve reusability along different dimensions. We use the opportunity of presenting at IDCC to invite the community to share their uses cases, experiences, and desiderata for the reuse of data cleaning workflows and recipes in order to foster new collaborations and guide future work

    Games and Argumentation: Time for a Family Reunion!

    Full text link
    The rule "defeated(X) \leftarrow attacks(Y,X), ¬\neg defeated(Y)" states that an argument is defeated if it is attacked by an argument that is not defeated. The rule "win(X) \leftarrow move(X,Y), ¬\neg win(Y)" states that in a game a position is won if there is a move to a position that is not won. Both logic rules can be seen as close relatives (even identical twins) and both rules have been at the center of attention at various times in different communities: The first rule lies at the core of argumentation frameworks and has spawned a large family of models and semantics of abstract argumentation. The second rule has played a key role in the quest to find the "right" semantics for logic programs with recursion through negation, and has given rise to the stable and well-founded semantics. Both semantics have been widely studied by the logic programming and nonmonotonic reasoning community. The second rule has also received much attention by the database and finite model theory community, e.g., when studying the expressive power of query languages and fixpoint logics. Although close connections between argumentation frameworks, logic programming, and dialogue games have been known for a long time, the overlap and cross-fertilization between the communities appears to be smaller than one might expect. To this end, we recall some of the key results from database theory in which the win-move query has played a central role, e.g., on normal forms and expressive power of query languages. We introduce some notions that naturally emerge from games and that may provide new perspectives and research opportunities for argumentation frameworks. We discuss how solved query evaluation games reveal how- and why-not provenance of query answers. These techniques can be used to explain how results were derived via the given query, game, or argumentation framework.Comment: Fourth Workshop on Explainable Logic-Based Knowledge Representation (XLoKR), Sept 2, 2023. Rhodes, Greec

    Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse

    Get PDF
    Before data from multiple sources can be analyzed, data cleaning workflows (“recipes”) usually need to be employed to improve data quality. We identify a number of technical problems that make application of FAIR principles to data cleaning recipes challenging. We then demonstrate how transparency and reusability of recipes can be improved by analyzing dataflow dependencies within recipes. In particular column-level dependencies can be used to automatically detect independent subworkflows, which then can be reused individually as data cleaning modules. We have prototypically implemented this approach as part of an ongoing project to develop open-source companion tools for OpenRefine. Keywords: Data Cleaning, Provenance, Workflow Analysi

    Exploring Geopolitical Realities through Taxonomies: The Case of Taiwan

    Get PDF
    In the face of heterogeneous standards and large-scale datasets, it has become increasingly difficult to understand the underlying knowledge structures within complex information systems. These structures may encode latent assumptions that could be susceptible to issues such as ghettoization, bias, erasure, or omission. Inspired by a series of current events in the China-Taiwan conflict on the sovereignty of Taiwan, our research aims to develop methods that can elucidate multiple, often conflicting perspectives and hidden assumptions. We propose the use of a logic-based taxonomy alignment approach to first align and then reconcile distinct but overlapping taxonomies. We specifically examine three relevant taxonomies that list the world entities: (1) ISO 3166 for country codes and subdivisions; (2) the geographic regions of the US Department of Homeland Security; (3) the Center Intelligence Agency’s World Fact Book. Our results highlight multiple alternate views (or Possible Worlds) for situating Taiwan relative to other neighboring entities. We hope that this work can be a first step to demonstrate how different geopolitical perspectives can be represented using multiple, interrelated taxonomies.Ope

    Full of beans: a study on the alignment of two flowering plants classification systems

    Get PDF
    Advancements in technologies such as DNA analysis have given rise to new ways in organizing organisms in biodiversity classification systems. In this paper, we examine the feasibility of aligning two classification systems for flowering plants using a logic-based, Region Connection Calculus (RCC-5) approach. The older “Cronquist system” (1981) classifies plants using their mor- phological features, while the more recent Angiosperm Phylogeny Group IV (APG IV) (2016) system classifies based on many new methods including genome-level analysis. In our approach, we align pairwise concepts X and Y from two taxonomies using five basic set relations: congruence (X=Y), inclusion (X>Y), inverse inclusion (X<Y), and disjointness (X!Y). With some of the RCC-5 relationships among the Fabaceae family (beans family) and the Sapindaceae family (maple family) uncertain, we anticipate that the merging of the two classification systems will lead to numerous merged solutions, so- called possible worlds. Our research demonstrates how logic-based alignment with ambiguities can lead to multiple merged solutions, which would not have been feasible when aligning taxonomies, classifications, or other knowledge organization systems (KOS) manually. We believe that this work can introduce a novel approach for aligning KOS, where merged possible worlds can serve as a minimum viable product for engaging domain experts in the loop.Ope

    Workflows and Provenance: Toward Information Science Solutions for the Natural Sciences

    Get PDF
    The era of big data and ubiquitous computation has brought with it concerns about ensuring reproducibility in this new research environment. It is easy to assume that computational methods self-document by their very nature of being exact, deterministic processes. However, similar to laboratory experiments, ensuring reproducibility in the computational realm requires the documentation of both the protocols used (workflows), as well as a detailed description of the computational environment: algorithms, implementations, software environments, and the data ingested and execution logs of the computation. These two aspects of computational reproducibility (workflows and execution details) are discussed within the context of biomolecular Nuclear Magnetic Resonance spectroscopy (bioNMR), as well as the PRIMAD model for computational reproducibility
    corecore